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Abstract: Even if storage was infinite, a data warehouse could not materialize all possible views due to the running time 
and update requirements. Therefore, it is necessary to estimate quickly, accurately, and reliably the size of 
views. Many available techniques make particular statistical assumptions and their error can be quite large. 
Unassuming techniques exist, but typically assume we have independent hashing for which there is no known 
practical implementation. We adapt an unassuming estimator due to Gibbons and Tirthapura: its theoretical 
bounds do not make unpractical assumptions. We compare this technique experimentally with stochastic 
probabilistic counting, LogLog probabilistic counting, and multifractal statistical models. Our experiments 
show that we can reliably and accurately (within 10%, 19 times out 20) estimate view sizes over large data 
sets (1.5 GB) within minutes, using almost no memory. However, only GIBBONS -TIRTHAPURA provides 
universally tight estimates irrespective of the size of the view. For large views, probabilistic counting has a 
small edge in accuracy, whereas the competitive sampling-based method (multifractal) we tested is an order 
of magnitude faster but can sometimes provide poor estimates (relative error of 100%). In our tests, LogLog 
probabilistic counting is not competitive. Experimental validation on the US Census 1990 data set and on the 
Transaction Processing Performance (TPC H) data set is provided. 



1 INTRODUCTION 

View materialization is presumably one of the most 
effective technique to improve query performance of 
data warehouses. Materialized views are physical 
structures that improve data access time by precom- 
puting intermediary results. Typical OLAP queries 
defined on data warehouses consist in selecting and 
aggregating data with queries such as grouping sets 
(GROUP BY clauses). By precomputing many plau- 
sible groupings, we can avoid aggregates over large 
tables. However, materializing views requires addi- 
tional storage space and induces maintenance over- 
head when refreshing the data warehouse. 

One of the most important issues in data ware- 
house physical design is to select an appropriate con- 
figuration of materialized views. Several heuris- 
tics and methodologies were proposed for the ma- 
terialized view selection problem, which is NP- 
hard (Gupta, 1997). Most of these techniques exploit 



cost models to estimate the data access cost using ma- 
terialized views, their maintenance and storage cost. 
This cost estimation mostly depends on view-size es- 
timation. 

Several techniques have been proposed for view- 
size estimation: some requiring assumptions about 
the data distribution and others that are "unassuming." 
A common statistical assumption is uniformity (Gol- 
farelli and Rizzi, 1998), but any skew in the data leads 
to an overestimate of the size of the view. Generally, 
while statistically assuming estimators are computed 
quickly, the most expensive step being the random 
sampling, their error can be large and it cannot be 
bounded a priori. 

In this paper, we consider several state-of-the- 
art statistically unassuming estimation techniques: 
GlBBONS-TlRTHAPURA (Gibbons and Tirthapura, 
2001), probabilistic counting (Flajolet and Martin, 
1985), and LogLog probabilistic counting (Durand 



and Flajolet, 2003). While relatively expensive, unas- 
suming estimators tend to provide a good accuracy. 
To our knowledge, this is the first experimental com- 
parisons of unassuming view-size estimation tech- 
niques in a data warehousing setting. 



2 RELATED WORK 



Haas et al. (Haas et al., 1995) estimate the view- 
size from the histogram of a sample: adaptively, they 
choose a different estimator based on the skew of the 
distribution. Faloutsos et al. (Faloutsos et al., 1996) 
obtain results nearly as accurate as Haas et al., that is, 
an error of approximately 40%, but they only need the 
dominant mode of the histogram, the number of dis- 
tinct elements in the sample, and the total number of 
elements. In sample-based estimations, in the worst- 
case scenario, the histogram might be as large as the 
view size we are trying to estimate. Moreover, it is 
difficult to derive unassuming accuracy bounds since 
the sample might not be representative. However, a 
sample-based algorithm is expected to be an order of 
magnitude faster than an algorithm which processes 
the entire data set. 

Probabilistic counting (Flajolet and Martin, 1985) 
and LOGLOG probabilistic counting (henceforth 
LOGLOG) (Durand and Flajolet, 2003) have been 
shown to provide very accurate unassuming view-size 
estimations quickly, but their estimates assume we 
have independent hashing. Because of this assump- 
tion, their theoretical bound may not hold in practice. 
Whether this is a problem in practice is one of the 
contribution of this paper. 

Gibbons and Tirthapura (Gibbons and Tirtha- 
pura, 2001) derived an unassuming bound (henceforth 
Gibbons -Tirthapura) that only requires pairwise 
independent hashing. It has been shown recently 
that if you have fe-wise independent hashing for k > 
2 the theoretically bound can be improved substan- 
tially (Lemire and Kaser, 2006). The benefit of 
Gibbons -Tirthapura is that as long as the ran- 
dom number generator is truly random, the theoret- 
ical bounds have to hold irrespective of the size of the 
view or of other factors. 

All unassuming estimation techniques in this pa- 
per (LOGLOG, probabilistic counting and GlBBONS- 
TlRTHAPURA), have an accuracy proportional to 
\j\pM where M is a parameter noting the memory 
usage. 



3 ESTIMATION BY 
MULTIFRACTALS 

We implemented the statistically assuming algo- 
rithm by Faloutsos et al. based on a multifractal 
model (Faloutsos et al., 1996). Nadeau and Teo- 
rey (Nadeau and Teorey, 2003) reported competitive 
results for this approach. Maybe surprisingly, given 
a sample, all that is required to learn the multifractal 
model is the number of distinct elements in the sam- 
ple Fq, the number of elements in the sample A^', the 
total number of elements N, and the number of occur- 
rences of the most frequent item in the sample m max . 
Hence, a very simple implementation is possible (see 
Algorithm 1). Faloutsos et al. erroneously introduced 
a tolerance factor e in their algorithm: unlike what 
they suggest, it is not possible, unfortunately, to ad- 
just the model parameter for an arbitrary good fit, but 
instead, we have to be content with the best possible 
fit (see line 9 and following). 



Algorithm 1 View-size estimation using a multifrac- 
tal distribution model. 

1 : INPUT: Fact table t containing N facts 

2: INPUT: GROUP BY query on dimensions 
D u D 2 ,...,D d 

3: INPUT: Sampling ratio < p < 1 

4: OUTPUT: Estimated size of GROUP BY query 

5: Choose a sample in t 1 of size N' — [pN\ 

6: Compute g=GROUP BY(t') 

7: let ra max be the number of occurrences of the most fre- 
quent tuple xi, . . . ,xj in g 

8: let Fq be the number of tuples in g 

9: *<- riogFol 
10: while F < F Q do 
11: P^(m max /N') l / k 

12: ^ Ito CX 1 -(/""I 1 -P)T') 

13: k^k+l 

14: p^(m miix /N) l / k 

15: RETURN: £* = (aK 1 - ~P)T) 



4 UNASSUMING VIEW-SIZE 
ESTIMATION 

4.1 Independent Hashing 

Hashing maps objects to values in a nearly random 
way. It has been used for efficient data structures such 
as hash tables and in cryptography. We are interested 
in hashing functions from tuples to [0,2 L ) where L is 
fixed (L = 32 in this paper). Hashing is uniform if 
P(h(x) =y) = l /2 L for all x,y, that is, if all hashed 



values are equally likely. Hashing is pairwise in- 
dependent if P(h(x\) = y Ah(x 2 ) = z) = P(h{x\) = 
y)P(h(x 2 ) = z) = 1/4 L for all Xi,x 2 ,y,z. Pairwise in- 
dependence implies uniformity. Hashing is fc-wise in- 
dependent ifP(h(xi)=y 1 A---Ah(x k )=y k ) = l/2 kL 
for all Xi,yi. Finally, hashing is (fully) independent if 
it is &-wise independent for all k. It is believed that 
independent hashing is unlikely to be possible over 
large data sets using a small amount of memory (Du- 
rand and Flajolet, 2003). 

Next, we show how 3-wise independent hashing 
is easily achieved in a multidimensional data ware- 
housing setting. For each dimension D,-, we build 
a lookup table 7], using the attribute values of D, 
as keys. Each time we meet a new key, we gener- 
ate a random number in [0,2L) and store it in the 
lookup table 7}. This random number is the hashed 
value of this key. This table generates (fully) inde- 
pendent hash values in amortized constant time. In 
a data warehousing context, whereas dimensions are 
numerous, each dimension will typically have few 
distinct values: for example, there are only 8,760 
hours in a year. Therefore, the lookup table will of- 
ten use a few Mib or less. When hashing a tuple 
xi,X2,--.,Xk in D\ x D 2 x ...Dk, we use the value 
7j (xi) T 2 (x 2 ) © • • • © Tk(x k ) where © is the exclu- 
sive OR operator. This hashing is 3-wise independent 
and requires amortized constant time. Tables 7} can be 
reused for several estimations. 

4.2 Probabilistic Counting 

Our implementation of (stochastic) probabilistic 
counting (Flajolet and Martin, 1985) is given in 
Algorithm 2. Recently, a variant of this algo- 
rithm, LogLog, was proposed (Durand and Flajolet, 
2003). Assuming independent hashing, these algo- 
rithms have standard error (or the standard deviation 
of the error) of 0.78/VM and 1.3/VM respectively. 
These theoretical results assume independent hashing 
which we cannot realistically provide. Thus, we do 
not expect these theoretical results to be always reli- 
able. 

4.3 GlBBONS-TlRTHAPURA 

Our implementation of the GIBBONS -TlRTHAPURA 
algorithm (see Algorithm 4) hashes each tuple only 
once unlike the original algorithm (Gibbons and 
Tirthapura, 2001). Moreover, the independence of the 
hashing depends on the number of dimensions used 
by the GROUP BY. If the view-size is smaller than the 
memory parameter (M), the view-size estimation is 



Algorithm 2 View-size estimation using (stochastic) 
probabilistic counting. 

1 : INPUT: Fact table t containing N facts 

2: INPUT: GROUP BY query on dimensions 

D u D 2 ,...,D d 
3: INPUT: Memory budget parameter M = 2 k 
4: INPUT: Independent hash function h from d tuples to 

[0,2*0. 

5: OUTPUT: Estimated size of GROUP BY query 
6: b ^- M x L matrix (initialized at zero) 
7: for tuple x e t do 

8: x' <— 1lDi,D 2 ,...,D d (x) {projection of the tuple} 
9: y^hix 1 ) {hash V to [0,2 L )} 
10: a = ymodM 

11: i <— position of the first 1-bit in [y/M\ 
12: h uJ - I 
13: A^0 

14: for ae {0,1, ...,M- 1} do 

15: increment A by the position of the first zero-bit in 

bafi,b a ,l,--- 
16: RETURN: M/§2 A I M where ~ 0.77351 



Algorithm 3 View-size estimation using LogLog. 

1 : INPUT: fact table t containing N facts 

2: INPUT: GROUP BY query on dimensions 

D u D 2 ,...,D d 
3: INPUT: Memory budget parameter M = 2 k 
4: INPUT: Independent hash function h from d tuples to 

[0,2 L ). 

5: OUTPUT: Estimated size of GROUP BY query 
6: M <-0,0,...,0 

M 

7: for tuple x € t do 

8: x' <— Tto, ,D 2 D d (x) {projection of the tuple} 

9: y^hix 1 ) {hash y to [0,2 L )} 
10: j <— value of the first k bits of y in base 2 
11: z <— position of the first 1-bit in the remaining L — k 

bits of y (count starts at 1) 
12: Mj <-max(fM),z) 

13: RETURN: a M M2v^i M i where a M « 0.39701 - 
(23l 2 + ln 2 2)/(48M). 



without error. For this reason, we expect GlBBONS- 
TlRTHAPURA to perform well when estimating small 
and moderate view sizes. 

The theoretical bounds given in (Gibbons and 
Tirthapura, 2001) assumed pairwise independence. 
The generalization below is from (Lemire and Kaser, 
2006) and is illustrated by Figure 1. 

Proposition 1 Algorithm 4 estimates the number of 
distinct tuples within relative precision £, with a k- 
wise independent hash for k>2by storing M distinct 
tuples (M > 8k) and with reliability 1 — 8 where 8 is 



Algorithm 4 Gibbons-Tirthapura view-size esti- 
mation^ 

1 : INPUT: Fact table t containing N facts 

2: INPUT: GROUP BY query on dimensions 
D h D 2 ,...,D d 

3: INPUT: Memory budget parameter M 

4: INPUT: fc-wise hash function h from d tuples to [0, 2 L ). 

5: OUTPUT: Estimated size of GROUP BY query 

6: 9A <— empty lookup table 

7: t^O 

8: for tuple x e t do 

9: x' <— TtDi.D 2 ,...,D d { x ) {projection of the tuple} 
10: y <- h(x') {hash x' to [0,2 L )} 
11: j <r- position of the first 1-bit in y (count starts at 0) 
12: if j <t then 
13: <KQ = j 
14: while size(JVf) > M do 
15: ti-t + l 

16: prune all entries in M having value less than / 

17: RETURN:.2'size(3Vf) 
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Figure 1: Bound on the estimation error (19 times out of 
20) as a function of the number of tuples kept in memory 
(M e [128,2048]) according to Proposition 1 for GlBBONS- 
TlRTHAPURA view-size estimation with £-wise indepen- 
dent hashing. 



given by 

8= _^_U + ^ ) 

e k/3 M k/iy e*(2*/2_i)y- 

More generally, we have 

**/ 2 / cx*/ 2 4*/ 2 \ 

- e k/3 M k/2 y ( l _ „)* + a k/2 E k( 2 k/2 _ !) J ■ 

for Ak/M < a < 1 and any k,M > 0. 

In the case where hashing is 4-wise independent, 
we derive a more concise bound. 

Corollary 1 With 4-wise independent hashing, Algo- 
rithm 4 estimates the number of distinct tuples within 
relative precision £ « 5/y/M, 19 times out of 20 for £ 
small. 





US Census 1990 


DBGEN 


# of facts 


2458285 


13977981 


# of views 


20 


8 


# of attributes 


69 


16 


Data size 


360 MB 


1.5 GB 



Table 1: Characteristic of data sets. 



Proof. We start from the second inequality of Propo- 
sition 1 . Differentiating "* 2 ,,. H — ,, - with 

respect to a and setting the result to zero, we get 
3oc 4 £ 4 + 16a 3 - 48a 2 -16 = (recall that 4k/M < 
a < 1). By multiscale analysis, we seek a solution of 
the form a = 1 — az r + o(e r ) and we have that a w 
1 - 1 /2{/372£ 4 / 3 for £ small. Substituting this value 

of a, we have ^ + ak/2 /£ /2 _ l} « 128/24£ 4 . The 
result follows by substituting in the second inequal- 
ity. □ 



5 EXPERIMENTAL RESULTS 

To benchmark the quality of the view-size estima- 
tion against the memory and speed, we have run test 
over the US Census 1990 data set (Hettich and Bay, 
2000) as well as on synthetic data produced by DB- 
GEN (TPC, 2006). The synthetic data was produced 
by running the DBGEN application with scale factor 
parameter equal to 2. The characteristics of data sets 
are detailed in Table 1 . We selected 20 and 8 views re- 
spectively from these data sets: all views in US Cen- 
sus 1990 have at least 4 dimensions whereas only 2 
views have at least 4 dimensions in the synthetic data 
set. 

We used the GNU C++ compiler version 4.0.2 
with the "-02" optimization flag on a Centrino Duo 
1.83 GHz machine with 2 GB of RAM running Linux 
kernel 2.6.13-15. No thrashing was observed. To 
ensure reproducibility, C++ source code is available 
freely from the authors. 

For the US Census 1990 data set, the hashing 
look-up table is a simple array since there are always 
fewer than 100 attribute values per dimension. Oth- 
erwise, for the synthetic DBGEN data, we used the 
GNU/CGI STL extension hash_map which is to be 
integrated in the C++ standard as an unorderecLmap: 
it provides amortized 0(1) inserts and queries. All 
other look-up tables are implemented using the STL 
map template which has the same performance char- 
acteristics of a red-black tree. We used comma sep- 
arated (CSV) (and pipe separated files for DBGEN) 
text files and wrote our own C++ parsing code. 



The test protocol we adopted (see Algorithm 5) 
has been executed for each estimation technique 
(LogLog, probabilistic counting and GlBBONS- 
Tirthapura), group by query, random seed and 
memory size. At each step corresponding to those pa- 
rameter values, we compute the estimated-size values 
of GROUP BYs and time required for their compu- 
tation. For the multifractal estimation technique, we 
computed at the same way the time and estimated size 
for each GROUP BY, sampling ratio value and ran- 
dom seed. 



Algorithm 5 Test protocol. 

1 : for group by query q e Q do 

2: for memory budget m £ M do 

3: for random seed value r £ R do 

4: Estimate the size of GROUP BY q with m mem- 

ory budget and r random seed value 

5: Save estimation results (time and estimated 

size) in a log file 



US Census 1990. Figure 2 plots the largest 95 th - 
percentile error observed over 20 test estimations 
for various memory size M e {16,64,256,2048}. 
For the multifractal estimation technique, we rep- 
resent the error for each sampling ratio p € 
{0.1%, 0.3%, 0.5%, 0.7%}. The X axis represents 
the size of the exact GROUP BY values. This 
95 th -percentile error can be related to the theoreti- 
cal bound for e with 19/20 reliability for GlBBONS- 
TlRTHAPURA (see Corollary 1): we see that this up- 
per bound is verified experimentally. However, the er- 
ror on "small" view sizes can exceed 100% for prob- 
abilistic counting and LogLog. 
Synthetic data set. Similarly, we computed the 
19/20 error for each technique, computed from the 
DDBGEN data set . We observed that the four tech- 
niques have the same behaviour observed on the US 
Census data set. Only, this time, the theoretical bound 
for the 19/20 error is larger because the synthetic data 
sets has many views with less than 2 dimensions. 
Speed. We have also computed the time needed for 
each technique to estimate view-sizes. We do not rep- 
resent this time because it is similar for each tech- 
nique except for the multifractal which is the fastest 
one. In addition, we observed that time do not depend 
on the memory budget because most time is spent 
streaming and hashing the data. For the multifrac- 
tal technique, the processing time increases with the 
sampling ratio. 

The time needed to estimate the size of all 
the views by GlBBONS-TlRTHAPURA, probabilis- 
tic counting and LogLog is about 5 minutes for 
US Census 1990 data set and 7 minutes for the syn- 
thetic data set. For the multifractal technique, all 



the estimates are done on roughly 2 seconds. This 
time does not include the time needed for sampling 
data which can be significant: it takes 1 minute (resp. 
4 minutes) to sample 0.5% of the US Census data set 
(resp. the synthetic data set - TPC H) because the 
data is not stored in a flat file. 



6 DISCUSSION 

Our results show that probabilistic counting and 
LogLog do not entirely live up to their theoretical 
promise. For small view sizes, the relative accuracy 
can be very low. 

When comparing the memory usage of the var- 
ious techniques, we have to keep in mind that the 
memory parameter M can translate in different mem- 
ory usage. The memory usage depends also on 
the number of dimensions of each view. Generally, 
Gibbons -Tirthapura will use more memory for 
the same value of M than either probabilistic counting 
or LOGLOG, though all of these can be small com- 
pared to the memory usage of the lookup tables 7} 
used for 3-wise independent hashing. In this paper, 
the memory usage was always of the order of a few 
MiB which is negligible in a data warehousing con- 
text. 

View-size estimation by sampling can take min- 
utes when data is not layed out in a flat file or in- 
dexed, but the time required for an unassuming es- 
timation is even higher. Streaming and hashing the 
tuples accounts for most of the processing time so for 
faster estimates, we could store all hashed values in a 
bitmap (one per dimension). 



7 CONCLUSION AND FUTURE 
WORK 

In this paper, we have provided unassuming tech- 
niques for view-size estimation in a data warehousing 
context. We adapted an estimator due to Gibbons and 
Tirthapura. We compared this technique experimen- 
tally with stochastic probabilistic counting, LogLog, 
and multifractal statistical models. We have demon- 
strated that among these techniques, only GlBBONS- 
TlRTHAPURA provides stable estimates irrespective 
of the size of views. Otherwise, (stochastic) proba- 
bilistic counting has a small edge in accuracy for rela- 
tively large views, whereas the competitive sampling- 
based technique (multifractal) is an order of mag- 
nitude faster but can provide crude estimates. Ac- 
cording to our experiments, LogLog was not faster 
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Figure 2: 95 th -percentile error 19/20 e as a function of exact view size for increasing values of M (US Census 1990) 



than either GIBBONS -TlRTHAPURA or probabilistic 
counting, and since it is less accurate than probabilis- 
tic counting, we cannot recommend it. There is ample 
room for future work. Firstly, we plan to extend these 
techniques to other types of aggregated views (for ex- 
ample, views including HAVING clauses). Secondly, 
we want to precompute the hashed values for very 
fast view-size estimation. Furthermore, these tech- 
niques should be tested in a materialized view selec- 
tion heuristic. 
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